feat: support array_contains in LabelList scalar index#5681
feat: support array_contains in LabelList scalar index#5681westonpace merged 9 commits intolance-format:mainfrom
Conversation
|
PTAL @westonpace. |
|
Discovered a corner data correctness bug in the LABEL_LIST index and filed #5682 for follow-up. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
westonpace
left a comment
There was a problem hiding this comment.
Looks good. Thanks for picking this up!
Will merge in a day or so in case you want to address any comments.
| # Include lists with NULL items to ensure NULL needle behavior matches | ||
| # non-index execution. | ||
| tbl = pa.table( | ||
| {"labels": [["foo", "bar"], ["bar"], ["baz"], ["qux", None], [None], []]} |
There was a problem hiding this comment.
Maybe also include an entry where the entire list is None?
| // Do not push down NULL needles. | ||
| if scalar.is_null() { | ||
| return None; | ||
| } |
There was a problem hiding this comment.
Actually, this is independent of #5682. #5682 is about the index missing rows where a valid element and a NULL co-exists (e.g., searching for 'foo' misses ['foo', None]). I will provide more details in #5682 later.
I've added a comment to explain the check. And array_has_any and array_has_all don't have this semantic mismatch as they natively follow membership logic.
| } | ||
| match expr { | ||
| Expr::Between(between) => Ok(visit_between(between, index_info)), | ||
| Expr::Alias(alias) => visit_node(alias.expr.as_ref(), index_info, depth), |
| if args.len() != 2 { | ||
| return None; | ||
| } | ||
| if func.name() == "array_has" { |
There was a problem hiding this comment.
How does this work for array_contains? Is Datafusion mapping that to array_has already?
If so, can we add a comment here mentioning that this branch is also going to be hit for array_contains?
|
The build-no-lock failure should have been fixed by #5690 |
04d8744 to
33558d5
Compare
|
PTAL @westonpace. |
|
Thanks for picking this up! |
…5681) closes lance-format#3687 Changes: - Make array_contains(list_col, value) use the existing LABEL_LIST scalar index. - DataFusion treats array_contains as an alias of array_has (often wrapped in an alias expr), so we unwrap that and map it to the LabelList index query. - Add a Python test and update the LabelList docs to mention array_has / array_contains.
…5681) closes lance-format#3687 Changes: - Make array_contains(list_col, value) use the existing LABEL_LIST scalar index. - DataFusion treats array_contains as an alias of array_has (often wrapped in an alias expr), so we unwrap that and map it to the LabelList index query. - Add a Python test and update the LabelList docs to mention array_has / array_contains.
closes #3687
Changes: